1. Data Cleaning
Creating connection to the sqlite database and downloading fires dataset
# Connecting
conn <- dbConnect(SQLite(), 'raw_data/FPA_FOD_20170508.sqlite')
as.data.frame(dbListTables(conn))
# Making fires dataframe
fires <- tbl(conn, "Fires") %>% collect()
Selecting columns of interest
fires_small <- fires %>%
select(NWCG_REPORTING_AGENCY, SOURCE_REPORTING_UNIT_NAME, FIRE_NAME,
FIRE_YEAR, DISCOVERY_DATE, DISCOVERY_DOY, DISCOVERY_TIME, CONT_DATE,
CONT_DOY, CONT_TIME, STAT_CAUSE_CODE, STAT_CAUSE_DESCR, FIRE_SIZE,
FIRE_SIZE_CLASS, LATITUDE, LONGITUDE, OWNER_CODE, OWNER_DESCR, STATE,
COUNTY, FIPS_CODE, FIPS_NAME, Shape)
fires_small <- clean_names(fires_small)
Changing some columms to be factors
fires_small <- fires_small %>%
mutate(nwcg_reporting_agency = as.factor(nwcg_reporting_agency)) %>%
mutate(stat_cause_code = as.factor(stat_cause_code)) %>%
mutate(fire_size_class = as.factor(fire_size_class)) %>%
mutate(owner_descr = as.factor(owner_descr)) %>%
mutate(state = as.factor(state))
2. Creating some initial visualisations
Fires per year
fires_small %>%
group_by(fire_year) %>%
summarise(num_fires =n()) %>%
ggplot +
aes(x = fire_year, y = num_fires) +
geom_point() +
# geom_col(fill = "dark blue", col ="white") +
geom_smooth(method = "lm", se = FALSE, colour = "red")
`summarise()` ungrouping output (override with `.groups` argument)
Warning messages:
1: In `[<-.data.frame`(`*tmp*`, is_list, value = list(`23` = "<S3: blob>")) :
replacement element 1 has 1 row to replace 0 rows
2: In `[<-.data.frame`(`*tmp*`, is_list, value = list(`23` = "<S3: blob>")) :
replacement element 1 has 1 row to replace 0 rows
3: In `[<-.data.frame`(`*tmp*`, is_list, value = list(`23` = "<S3: blob>")) :
replacement element 1 has 1 row to replace 0 rows
4: In `[<-.data.frame`(`*tmp*`, is_list, value = list(`23` = "<S3: blob>")) :
replacement element 1 has 1 row to replace 0 rows
5: In `[<-.data.frame`(`*tmp*`, is_list, value = list(`23` = "<S3: blob>")) :
replacement element 1 has 1 row to replace 0 rows
6: In `[<-.data.frame`(`*tmp*`, is_list, value = list(`23` = "<S3: blob>")) :
replacement element 1 has 1 row to replace 0 rows
7: In `[<-.data.frame`(`*tmp*`, is_list, value = list(`23` = "<S3: blob>")) :
replacement element 1 has 1 row to replace 0 rows

It can be seen from the linear modelling smoother that there is a slight increase of wildfire over the recording period, but there is a lot of variation in the data between years. There is almost a repeating pattern occurring with 4 peaks visible. Having looked at the historic weather for that date range these peaks seems to coincide with recorded heatwaves in 2000, 2006 and 2011.
Fires per day
fires_small %>%
group_by(discovery_date) %>%
summarise(num_fires = n()) %>%
ggplot +
aes(x = discovery_date, y = num_fires) +
geom_line(col = "dark blue")
`summarise()` ungrouping output (override with `.groups` argument)

This shows a typical time series plot with a cyclic variation due to warmer weather in the summer time.
Fires per month
fires_small %>%
mutate(year_month = make_date(fire_year, discovery_moy)) %>%
group_by(year_month) %>%
summarise(num_fires = n()) %>%
ggplot +
aes(x = year_month, y = num_fires) +
geom_line(col = "dark blue")
`summarise()` ungrouping output (override with `.groups` argument)

Peaks are still shown to be occurring in the summer. The 2006 heatwave is especially visable.
Fires by day of year
fires_small %>%
group_by(discovery_doy) %>%
summarise(num_fires = n()) %>%
ggplot(aes(x = discovery_doy, y = num_fires)) +
geom_line(col = "dark blue")
`summarise()` ungrouping output (override with `.groups` argument)

The are peaks around day 60-110 and a big peak around 180.
Checking the data to see where the peak occurs
fires_small %>%
group_by(discovery_doy) %>%
summarise(num_fires = n()) %>%
arrange(desc(num_fires))
`summarise()` ungrouping output (override with `.groups` argument)
The 2 highest days of the year are on 185 and 186, which happens to be Independence Day (4th July) on a normal year and a leap year retrospectively. So I imagine most of the extra fires (literally over double the normal amount) are caused by fireworks.
Fires by month of year
fires_small %>%
mutate(discovery_moy = (month(ymd(discovery_date), label = TRUE))) %>%
group_by(discovery_moy) %>%
summarise(num_fires = n()) %>%
ggplot(aes(x = discovery_moy, y = num_fires)) +
geom_col(fill = "dark blue", col = "white")
`summarise()` ungrouping output (override with `.groups` argument)

There are 2 definite peaks during the year. March and April are probably due to the US “Spring Break”, where schools and Universities are stopped and so families are likely to be on vacation during that period possibly visiting National Parks. July and August is also Summer Break for school with both families visiting Parks and hot weather likely causes of fire outbreaks.
Fires by cause
options(scipen = 999)
fires_small %>%
group_by(stat_cause_descr) %>%
summarise(num_fires = n()) %>%
ggplot +
aes(reorder(x = stat_cause_descr, num_fires), y = num_fires) +
geom_col(fill = "dark blue") +
coord_flip()
`summarise()` ungrouping output (override with `.groups` argument)

Fire avg size by cause
fires_small %>%
group_by(stat_cause_descr) %>%
summarise(avg_size = mean(fire_size)) %>%
ggplot +
aes(reorder(x = stat_cause_descr, avg_size), y = avg_size) +
geom_col(fill = "dark blue") +
coord_flip()
`summarise()` ungrouping output (override with `.groups` argument)

Avg burn time by cause
fires_small %>%
summarise(num_na = sum(is.na(cont_date)))
Literally half the data is missing for burn time, making it very difficult to do any meaningful analysis
Fires by size
fires_small %>%
group_by(fire_size_class) %>%
summarise(num_fires = n()) %>%
ggplot +
aes(x = fire_size_class, y = num_fires, fill = fire_size_class) +
geom_col() +
scale_fill_manual(values = c("red", "orange", "yellow", "green", "blue",
"purple", "black"),
name = "Fire Size Classification",
breaks = c("A", "B", "C", "D", "E", "F", "G"),
labels = c("A: < 1/4 acre", "B: 1/4 to 10 acres", "C: 10 to 100 acres",
"D: 100 to 300 acres", "E: 300 to 1000 acres",
"F: 1000 to 5000 acres", "G: More than 5000 acres"))
`summarise()` ungrouping output (override with `.groups` argument)

3. Geo Spatial Visualisations
The dataset has a cause of fire column so I shall now create some causation plots.
Getting list of fire causes
fires_states %>%
distinct(stat_cause_descr) %>%
arrange(-desc(stat_cause_descr))
Wildfires caused by Arson
cause("Arson")
`summarise()` ungrouping output (override with `.groups` argument)

Arson does seem more prevalent in the SE states of Mississippi, Georgia, Alabama and also the western state of California.
Wildfires caused by Campfire
cause("Campfire")
`summarise()` ungrouping output (override with `.groups` argument)

Campfires are the most prevalent in the Western states of Oregon, California and Arizona.
Wildfires caused by Children
cause("Children")
`summarise()` ungrouping output (override with `.groups` argument)

Fires by children are spread about the country, but the most prevalent states are California in the West, Alabama and South Carolina and New Jersey in the east.
Wildfires caused by Debris Burning
cause("Debris Burning")
`summarise()` ungrouping output (override with `.groups` argument)

Fires by burning debris are mostly in the southern warmer states of Texas, Georgia and North Carolina.
Wildfires caused by Equiment Use
cause("Equipment Use")
`summarise()` ungrouping output (override with `.groups` argument)

Most of the fires caused by equipment seem to be in California
Wildfires caused by Fireworks
cause("Fireworks")
`summarise()` ungrouping output (override with `.groups` argument)

Most of the fires caused by fireworks seem to be in the north of the country. Primarily South Dakota, Montana and Washington state.
Wildfires caused by Lightning
cause("Lightning")
`summarise()` ungrouping output (override with `.groups` argument)

Apart from a hotspot of lightning strikes in Florida, the vast majority of fires caused by lightning are in the West of the country. With the 3 most affected states being California, Oregon and Arizona.
Wildfires caused by Miscellious
cause("Miscellaneous")
`summarise()` ungrouping output (override with `.groups` argument)

There seems to be quite a few miscellaneous classifications in California, Texas and New York.
Wildfires caused by Missing/Undefined
cause("Missing/Undefined")
`summarise()` ungrouping output (override with `.groups` argument)

The states with the most missing or undefined data is North and South Carolina, Oklahoma and California.
Wildfires caused by Powerline
cause("Powerline")
`summarise()` ungrouping output (override with `.groups` argument)

Texas has the largest amount of wildfires caused by powerlines. This is likely due to the warm climate and the large proportion of the state that is dry grasslands used for agriculture.
Wildfires caused by Railroad
cause("Railroad")
`summarise()` ungrouping output (override with `.groups` argument)

By far Florida has the most wildfires caused by railroads.
Wildfires caused by Smoking
cause("Smoking")
`summarise()` ungrouping output (override with `.groups` argument)

Fires caused by smoking seem to be spread around the country, but mainly on the east and west coasts.
Wildfires caused by Structure
cause("Structure")
`summarise()` ungrouping output (override with `.groups` argument)

South Dakota has the largest proportion of fires caused by structures.
Unsurprisingly the southern states seem to have more occurences of wildifre in general, no doubt due to the warmer climate at their latitudes. Also the 1st and 3rd states with the highest number of fires are also the 2 largest States by size. However the 2nd highest State is Georgia, which although it is in the South of the country is only an average sized State. Therefore to get a better picture of what is going on I’m going to look at the proportion of fires occuring by square mile by normalising the State size.
The dataset package also has the area in square miles of each state included in the state.area vector.
state.area
[1] 51609 589757 113909 53104 158693 104247 5009 2057 58560 58876 6450
[12] 83557 56400 36291 56290 82264 40395 48523 33215 10577 8257 58216
[23] 84068 47716 69686 147138 77227 110540 9304 7836 121666 49576 52586
[34] 70665 41222 69919 96981 45333 1214 31055 77047 42244 267339 84916
[45] 9609 40815 68192 24181 56154 97914
length(state.area)
[1] 50
Annoyingly it also only has 50 states not 52 so I will need to add in DC and PR back in.
(Area figures obtained from Wikipedia)
DC = 68 miles^2 PR = 3515 miles^2
# To make my life easier I'm going to remove the state.abb and .name files and make the tibble again, adding in the land area figures at the same time to make sure they are in the correct order.
rm(state.abb)
rm(state.name)
state.abb <- append(state.abb, c("DC", "PR"))
state.name <- append(state.name, c("District of Columbia", "Puerto Rico"))
state.area <- append(state.area, c("68", "3515"))
state_list <- tibble(state = state.abb, region = tolower(state.name), area = as.numeric(state.area))
# Re-joing tibbles
fires_states <- fires_small %>%
left_join(state_list, by = "state")
Normalising States area sizes
fires_states %>%
select(region, area) %>%
group_by(region, area) %>%
summarise(num_fires = n()) %>%
mutate(fires_sqmile = num_fires / area) %>%
arrange(desc(fires_sqmile))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
This table shows Puerto Rico has the highest proportion of fires compared to its size, followed by New Jersey in the NE of the country and finally by the States in the SE of the country.
fires_states %>%
select(region, area) %>%
group_by(region, area) %>%
summarise(num_fires = n()) %>%
mutate(fires_sqmile = num_fires / area) %>%
right_join(state_map, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = fires_sqmile)) +
geom_polygon() +
geom_path(color = "white") +
scale_fill_distiller(name = "Fire per Sq Mile", palette = "PuBuGn") +
#scale_fill_continuous(low = "darkblue",
# high = "darkred",
# name = "Fire per Sq mile") +
theme_map() +
coord_map("mollweide") +
ggtitle(paste0("Total US Wildfires per Square Mile from 1992-2015")) +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)

Puerto Rico is not shown on this map, but visually we can see the data for the other 51 entries, and the south eastern states still have the highest proportion of wildfires. Interestingly New Jersey also shows has a hotspot in the NE of the country.
Do causes change over time?
Splitting causes into 2 group for legibility.
The first group is for directly man created fires.
fires_states %>%
select(stat_cause_descr, fire_year) %>%
group_by(fire_year, stat_cause_descr) %>%
filter(stat_cause_descr == "Arson" | stat_cause_descr == "Campfire" |
stat_cause_descr == "Children" | stat_cause_descr == "Equipment Use" |
stat_cause_descr == "Fireworks" | stat_cause_descr == "Smoking") %>%
summarise(num_fires = n()) %>%
ggplot +
aes(x = fire_year, y = num_fires, colour = stat_cause_descr) +
geom_line()
`summarise()` regrouping output by 'fire_year' (override with `.groups` argument)
Warning message:
In `[<-.data.frame`(`*tmp*`, is_list, value = list(`23` = "<S3: blob>")) :
replacement element 1 has 1 row to replace 0 rows

The 2 large peaks in Arson are obvious in 1999 and 2006. There was a large heatwave in 2006, but I’m not sure why this would result in an increase in arson. Unless this was jsut due to the dry ground creating extra fuel to aid the spread of fires that would have normally not resulted in a large scale fire. This may also be the same reason that there is also another peak in 2006 for Equipment Use. Arson however does look to be decreasing since 2006.
And this one for natural occuring fires.
fires_states %>%
select(stat_cause_descr, fire_year) %>%
group_by(fire_year, stat_cause_descr) %>%
filter(stat_cause_descr == "Debris Burning" | stat_cause_descr == "Lightning" |
stat_cause_descr == "Miscellaneous" | stat_cause_descr ==
"Missing/Undefined" | stat_cause_descr == "Powerline" |
stat_cause_descr == "Railroad" | stat_cause_descr == "Structure") %>%
summarise(num_fires = n()) %>%
ggplot +
aes(x = fire_year, y = num_fires, colour = stat_cause_descr) +
geom_line()
`summarise()` regrouping output by 'fire_year' (override with `.groups` argument)

Similar peaks can be seen in Debris, Miscellaneous and lightning in the heatwave of 2006 that left the ground very dry. There are peaks from 1997 to 2003 in debris, miscellaneous and lightening, but also a trough in missing/undefined, so this is likely to be due to more accurate classification of fires and not using the missing/undefined category as much.
Difference in causes between states
state_map_southern <- state_map %>%
filter(region == "florida" | region == "georgia" | region == "alabama" |
region == "mississippi" | region == "south carolina" |
region == "north carolina" | region == "tennessee" |
region == "arkansas" | region == "louisiana")
fires_states %>%
filter(fire_year == "1992" | fire_year == "1993" | fire_year == "1994" |
fire_year == "1995") %>%
filter(region == "florida" | region == "georgia" | region == "alabama" |
region == "mississippi" | region == "south carolina" |
region == "north carolina" | region == "tennessee" |
region == "arkansas" | region == "louisiana") %>%
select(region, stat_cause_descr) %>%
group_by(region, stat_cause_descr) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map_southern, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = stat_cause_descr)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
ggtitle("Total US Wildfires main cause from 1992-1995") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
filter(fire_year == "1996" | fire_year == "1997" | fire_year == "1998" |
fire_year == "1999") %>%
filter(region == "florida" | region == "georgia" | region == "alabama" |
region == "mississippi" | region == "south carolina" |
region == "north carolina" | region == "tennessee" |
region == "arkansas" | region == "louisiana") %>%
select(region, stat_cause_descr) %>%
group_by(region, stat_cause_descr) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map_southern, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = stat_cause_descr)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
ggtitle("Total US Wildfires main cause from 1996-1999") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
filter(fire_year == "2000" | fire_year == "2001" | fire_year == "2002" |
fire_year == "2003") %>%
filter(region == "florida" | region == "georgia" | region == "alabama" |
region == "mississippi" | region == "south carolina" |
region == "north carolina" | region == "tennessee" |
region == "arkansas" | region == "louisiana") %>%
select(region, stat_cause_descr) %>%
group_by(region, stat_cause_descr) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map_southern, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = stat_cause_descr)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
ggtitle("Total US Wildfires main cause from 2000-2003") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
filter(fire_year == "2004" | fire_year == "2005" | fire_year == "2006" |
fire_year == "2007") %>%
filter(region == "florida" | region == "georgia" | region == "alabama" |
region == "mississippi" | region == "south carolina" |
region == "north carolina" | region == "tennessee" |
region == "arkansas" | region == "louisiana") %>%
select(region, stat_cause_descr) %>%
group_by(region, stat_cause_descr) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map_southern, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = stat_cause_descr)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
ggtitle("Total US Wildfires main cause from 2004-2007") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
filter(fire_year == "2008" | fire_year == "2009" | fire_year == "2010" |
fire_year == "2011") %>%
filter(region == "florida" | region == "georgia" | region == "alabama" |
region == "mississippi" | region == "south carolina" |
region == "north carolina" | region == "tennessee" |
region == "arkansas" | region == "louisiana") %>%
select(region, stat_cause_descr) %>%
group_by(region, stat_cause_descr) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map_southern, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = stat_cause_descr)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
ggtitle("Total US Wildfires main cause from 2008-2011") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
filter(fire_year == "2012" | fire_year == "2013" | fire_year == "2014" |
fire_year == "2015") %>%
filter(region == "florida" | region == "georgia" | region == "alabama" |
region == "mississippi" | region == "south carolina" |
region == "north carolina" | region == "tennessee" |
region == "arkansas" | region == "louisiana") %>%
select(region, stat_cause_descr) %>%
group_by(region, stat_cause_descr) %>%
summarise(num_fire = n()) %>%
top_n(1) %>%
right_join(state_map_southern, by = "region") %>%
ggplot +
(aes(x = long, y = lat, group = group, fill = stat_cause_descr)) +
geom_polygon() +
geom_path(color = "white") +
theme_map() +
scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
ggtitle("Total US Wildfires main cause from 2012-2015") +
theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

Looking at these trends some interesting insights can be seen. For the combined years data Florida stands out as having railroad as its main cause of wildfire, but from the above plots it can be seen that these railroad fires are only the main cause up to the 4 yearly period ending in 2003 and then the main cause changes to lightning until the end of the collection period in 2015. Similarly arson seem reasonably popular in the southern states until 2007, when it no longer appear as the most common cause of wildfire. This downward trend was also noted earlier in the overall causation plots for all states
---
title: "R Notebook"
output: html_notebook
---

```{r}
library(tidyverse)
library(RSQLite)
library(dbplyr)
library(janitor)
library(lubridate)
library(datasets)
library(ggthemes)
library(gganimate)
```


# 1.  Data Cleaning


####  Creating connection to the sqlite database and downloading fires dataset

```{r}
# Connecting

conn <- dbConnect(SQLite(), "raw_data/FPA_FOD_20170508.sqlite")
```

```{r}
# Pulling all the names of the tables in the database file

as.data.frame(dbListTables(conn))
```

```{r}
# Making fires dataframe

fires <- tbl(conn, "Fires") %>% collect()
```


#### Seeing what other useful information is in the database.  The majority are part of the database structure and are not readable in R.

```{r}
# EPSG worldwide geodetic parameter dataset system
spatial_ref <- tbl(conn, "spatial_ref_sys_all") %>% collect()

# National Wildfire Coordinating Group unit abbreviations 
NWGG <- tbl(conn, "NWCG_UnitIDActive_20170109") %>% collect()
```


```{r}
# Disconnect

dbDisconnect(conn)
```


### Selecting columns of interest

```{r}
fires_small <- fires %>%
  select(NWCG_REPORTING_AGENCY, SOURCE_REPORTING_UNIT_NAME, FIRE_NAME,
         FIRE_YEAR, DISCOVERY_DATE, DISCOVERY_DOY, DISCOVERY_TIME, CONT_DATE,
         CONT_DOY, CONT_TIME, STAT_CAUSE_CODE, STAT_CAUSE_DESCR, FIRE_SIZE, 
         FIRE_SIZE_CLASS, LATITUDE, LONGITUDE, OWNER_CODE, OWNER_DESCR, STATE, 
         COUNTY, FIPS_CODE, FIPS_NAME, Shape)

fires_small <- clean_names(fires_small)
```


### Changing some columms to be factors

```{r}
fires_small <- fires_small %>%
  mutate(nwcg_reporting_agency = as.factor(nwcg_reporting_agency)) %>%
  mutate(stat_cause_code = as.factor(stat_cause_code)) %>%
  mutate(fire_size_class = as.factor(fire_size_class)) %>%
  mutate(owner_descr = as.factor(owner_descr)) %>%
  mutate(state = as.factor(state)) 
```


### Date is in Julian format, so overwriting with Gregorian format using year and day of year columns.  Also adding in a 'month of year column' for future use.

```{r}
fires_small <- fires_small %>%
  mutate(date_origin = as.Date(paste0(fire_year, "-01-01"))) %>%
  mutate(discovery_date = as.Date(discovery_doy, origin = date_origin)) %>%
  mutate(discovery_moy = month(discovery_date)) %>%
  select(-date_origin)
```



# 2. Creating some initial visualisations


### Fires per year

```{r}
fires_small %>%
  group_by(fire_year) %>%
  summarise(num_fires =n()) %>%
  ggplot +
  aes(x = fire_year, y = num_fires) +
  geom_point() +
  # geom_col(fill = "dark blue", col ="white") +
  geom_smooth(method = "lm", se = FALSE, colour = "red")
```
**It can be seen from the linear modelling smoother that there is a slight increase of wildfire over the recording period, but there is a lot of variation in the data between years.  There is almost a repeating pattern occurring with 4 peaks visible.  Having looked at the historic weather for that date range these peaks seems to coincide with recorded heatwaves in 2000, 2006 and 2011.**


### Fires per day

```{r}
fires_small %>%
  group_by(discovery_date) %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = discovery_date, y = num_fires) +
  geom_line(col = "dark blue")

```

**This shows a typical time series plot with a cyclic variation due to warmer weather in the summer time.**


### Fires per month

```{r}
fires_small %>%
  mutate(year_month = make_date(fire_year, discovery_moy)) %>%
  group_by(year_month) %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = year_month, y = num_fires) +
  geom_line(col = "dark blue")
```

**Peaks are still shown to be occurring in the summer. The 2006 heatwave is especially visable.**


### Fires by day of year


```{r}
fires_small %>%
  group_by(discovery_doy) %>%
  summarise(num_fires = n()) %>%
  ggplot(aes(x = discovery_doy, y = num_fires)) +
  geom_line(col = "dark blue")
```

**The are peaks around day 60-110 and a big peak around 180.**

#### Checking the data to see where the peak occurs

```{r}
fires_small %>%
  group_by(discovery_doy) %>%
  summarise(num_fires = n()) %>%
  arrange(desc(num_fires))
```
**The 2 highest days of the year are on 185 and 186, which happens to be Independence Day (4th July) on a normal year and a leap year retrospectively.  So I imagine most of the extra fires (literally over double the normal amount) are caused by fireworks.**



### Fires by month of year

```{r}
fires_small %>%
  mutate(discovery_moy = (month(ymd(discovery_date), label = TRUE))) %>%
  group_by(discovery_moy) %>%
  summarise(num_fires = n()) %>%
  ggplot(aes(x = discovery_moy, y = num_fires)) +
  geom_col(fill = "dark blue", col = "white")
```

**There are 2 definite peaks during the year.  March and April are probably due to the US "Spring Break", where schools and Universities are stopped and so families are likely to be on vacation during that period possibly visiting National Parks.  July and August is also Summer Break for school with both families visiting Parks and hot weather likely causes of fire outbreaks.**




### Fires by cause

```{r}
options(scipen = 999)

fires_small %>%
  group_by(stat_cause_descr) %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(reorder(x = stat_cause_descr, num_fires), y = num_fires) +
  geom_col(fill = "dark blue") + 
  coord_flip() 
```


### Fire avg size by cause

```{r}
fires_small %>%
  group_by(stat_cause_descr) %>%
  summarise(avg_size = mean(fire_size)) %>%
  ggplot +
  aes(reorder(x = stat_cause_descr, avg_size), y = avg_size) +
  geom_col(fill = "dark blue") + 
  coord_flip()
```


### Avg burn time by cause

```{r}
fires_small %>%
  summarise(num_na = sum(is.na(cont_date)))
```
*Literally half the data is missing for burn time, making it very difficult to do any meaningful analysis*



### Fires by size


```{r}
fires_small %>%
  group_by(fire_size_class) %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = fire_size_class, y = num_fires, fill = fire_size_class) +
  geom_col() +
  scale_fill_manual(values = c("red", "orange", "yellow", "green", "blue", 
                               "purple", "black"),
                    name = "Fire Size Classification",
                    breaks = c("A", "B", "C", "D", "E", "F", "G"),
                    labels = c("A: < 1/4 acre", "B: 1/4 to 10 acres", "C: 10 to 100 acres",
                               "D: 100 to 300 acres", "E: 300 to 1000 acres",
                               "F: 1000 to 5000 acres", "G: More than 5000 acres"))

```



# Geo Spatial wrangling 


### To make it easier to visually detect frequency of wildfires between states I want display it in a map format.  As I'm using ggplot2 already I'm going to also use it for maps with the `geom_polygon()`, `coord_map()` along with the ggthemes `theme_map()` functions.


#### I'm not entirely sure what geo-spatial information is being held with in the sqlite database file, I've made a few attempts to retrieve it but have been unsuccessful.  Therefore I'm going to utelise the `datasets` package which includes various bits of information on the US States, including coordinates for state boundaries.


```{r}
# State boundary co-ordinates from 'datasets' package

state_map <- map_data("state")
state_map
```


#### Annoyingly it doesn't have the abbreviation of the State, only the full name so I need to add that in.  Luckily the 'datasets' package also has a vector of States names and abbreviations so I shall make a tibble with them both in.


```{r}
state.abb
```

```{r}
state.name
```

```{r}
state_list <- tibble(state = state.abb, state_name = state.name)
state_list
```


#### The `state_map` dataframe is in lower case and has the column name 'region'.  I shall change the `state_list` tibble to be the same format so they can be joined together.


```{r}
state_list <- tibble(state = state.abb, region = tolower(state.name))
```


#### Joing `state_list` to `fires_small` datasets

```{r}
fires_states <- fires_small %>%
  left_join(state_list, by = "state")

fires_states
```


#### Checking the join has worked and there are no missing values.

```{r}
fires_states %>%
  filter(is.na(region))
```


#### There does seem to be 22,147 NAs in the 'region' column we just made.  Scrolling through there are 2 missing States of 'PR' and 'DC' in the `states_list` tibble.

#### After some quick research it seems that there are only 50 States in the US. Washington DC is techincally not counted as a state but as a Federal District, as it is the seat of government, so that was why it wasn't included in the `States` tibble originally.  PR is Puerto Rico and is also not a state but the largest US territory .


#### I shall add DC and PR into the state_list and re-join it.

```{r}
# Adding 2 new states

state.abb <- append(state.abb, c("DC", "PR"))
state.name <- append(state.name, c("District of Columbia", "Puerto Rico"))

state_list <- tibble(state = state.abb, region = tolower(state.name))
```


```{r}
# Re-joing tibbles

fires_states <- fires_small %>%
  left_join(state_list, by = "state")
```


```{r}
# Checking the join has worked properly and there are no NAs

fires_states %>%
  filter(is.na(region))
```

```{r}
# Code below brings up a "vector memory exhausted (limit reached?)" error

# fires_joined <- fires_states %>%
#  right_join(state_map, by = "region")
```


#### The data set and geo information is too big to join so I'm going to do a summarise first to get the number of fires per region first.

```{r}
fires_joined <- fires_states %>% 
    select(region) %>%
    group_by(region) %>%
    summarise(num_fires = n()) %>%
    right_join(state_map, by = "region")
```

**Result!!  Now doing first geo spatial visualisation**


### Total Wildfires per state from 1992-2015

```{r}
fires_joined %>% 
    ggplot +
    (aes(x = long, y = lat, group = group, fill = num_fires)) + 
    geom_polygon() + 
    geom_path(color = "white") + 
    scale_fill_continuous(low = "darkblue", 
                          high = "darkred",
                          name = "Number of fires") + 
    theme_map() + 
    coord_map("mollweide") + 
    ggtitle("Total US Wildfires from 1992-2015") + 
    theme(plot.title = element_text(hjust = 0.5))
```


# 3. Geo Spatial Visualisations

### The dataset has a cause of fire column so I shall now create some causation plots.


#### Getting list of fire causes

```{r}
fires_states %>%
  distinct(stat_cause_descr) %>%
  arrange(-desc(stat_cause_descr))
```


### Total fire by cause in tabular form

```{r}
fires_states %>%
  select(stat_cause_descr) %>%
  group_by(stat_cause_descr) %>%
  summarise(num_fires = n ()) %>%
  arrange(desc(num_fires))
  
```

### Number of fires by state in tabular form

```{r}
fires_states %>%
  select(region) %>%
  group_by(region) %>%
  summarise(num_fires = n()) %>%
  arrange(desc(num_fires))
```


#### As the cause needs to be filtered before the map join, I'm going to either going to have to repeat a whole load of the same code in every single plot or write a function that will do it for me with, saving a lot of typing!

```{r}
# Function for plotting cause of fire

cause <- function(cause) {
  fires_states %>%
    filter(stat_cause_descr == cause) %>%
    select(region) %>%
    group_by(region) %>%
    summarise(num_fires = n ()) %>%
    right_join(state_map, by = "region") %>%
    ggplot +
    (aes(x = long, y = lat, group = group, fill = num_fires)) + 
    geom_polygon() + 
    geom_path(color = "white") + 
    scale_fill_continuous(low = "darkblue", 
                          high = "darkred",
                          name = "Number of fires") + 
    theme_map() + 
    coord_map("mollweide") + 
    ggtitle(paste0("Total US Wildfires caused by ", cause, " from 1992-2015")) + 
    theme(plot.title = element_text(hjust = 0.5))
}
```



### Wildfires caused by Arson

```{r}
cause("Arson")
```

**Arson does seem more prevalent in the SE states of Mississippi, Georgia, Alabama and also the western state of California.**


### Wildfires caused by Campfire

```{r}
cause("Campfire")
```

**Campfires are the most prevalent in the Western states of Oregon, California and Arizona.**


### Wildfires caused by Children

```{r}
cause("Children")
```

**Fires by children are spread about the country, but the most prevalent states are California in the West, Alabama and South Carolina and New Jersey in the east.**


### Wildfires caused by Debris Burning

```{r}
cause("Debris Burning")
```

**Fires by burning debris are mostly in the southern warmer states of Texas, Georgia and North Carolina.**

### Wildfires caused by Equiment Use

```{r}
cause("Equipment Use")
```

**Most of the fires caused by equipment seem to be in California**


### Wildfires caused by Fireworks

```{r}
cause("Fireworks")
```

**Most of the fires caused by fireworks seem to be in the north of the country.  Primarily South Dakota, Montana and Washington state.**


### Wildfires caused by Lightning

```{r}
cause("Lightning")
```

**Apart from a hotspot of lightning strikes in Florida, the vast majority of fires caused by lightning are in the West of the country.  With the 3 most affected states being California, Oregon and Arizona.**

### Wildfires caused by Miscellious

```{r}
cause("Miscellaneous")
```

**There seems to be quite a few miscellaneous classifications in California, Texas and New York.**


### Wildfires caused by Missing/Undefined

```{r}
cause("Missing/Undefined")
```

**The states with the most missing or undefined data is North and South Carolina, Oklahoma and California.**


### Wildfires caused by Powerline

```{r}
cause("Powerline")
```

**Texas has the largest amount of wildfires caused by powerlines.  This is likely due to the warm climate and the large proportion of the state that is dry grasslands used for agriculture.**


### Wildfires caused by Railroad

```{r}
cause("Railroad")
```

**By far Florida has the most wildfires caused by railroads.**


### Wildfires caused by Smoking

```{r}
cause("Smoking")
```

**Fires caused by smoking seem to be spread around the country, but mainly on the east and west coasts.**


### Wildfires caused by Structure

```{r}
cause("Structure")
```

**South Dakota has the largest proportion of fires caused by structures.**



#### Unsurprisingly the southern states seem to have more occurences of wildifre in general, no doubt due to the warmer climate at their latitudes.  Also the 1st and 3rd states with the highest number of fires are also the 2 largest States by size. However the 2nd highest State is Georgia, which although it is in the South of the country is only an average sized State.  Therefore to get a better picture of what is going on I'm going to look at the proportion of fires occuring by square mile by normalising the State size.

#### The `dataset` package also has the area in square miles of each state included in the `state.area` vector.

```{r}
state.area
```

```{r}
length(state.area)
```

#### Annoyingly it also only has 50 states not 52 so I will need to add in DC and PR back in.  

(Area figures obtained from Wikipedia)

DC = 68 miles^2
PR = 3515 miles^2


```{r}
# To make my life easier I'm going to remove the state.abb and .name files and make the tibble again, adding in the land area figures at the same time to make sure they are in the correct order.

rm(state.abb)
rm(state.name)

state.abb <- append(state.abb, c("DC", "PR"))
state.name <- append(state.name, c("District of Columbia", "Puerto Rico"))
state.area <- append(state.area, c("68", "3515"))

state_list <- tibble(state = state.abb, region = tolower(state.name), area = as.numeric(state.area))
```

```{r}
# Re-joining tibbles

fires_states <- fires_small %>%
  left_join(state_list, by = "state")
```


### Normalising States area sizes

```{r}
fires_states %>%
  select(region, area) %>%
  group_by(region, area) %>%
  summarise(num_fires = n()) %>%
  mutate(fires_sqmile = num_fires / area) %>%
  arrange(desc(fires_sqmile))
```

#### This table shows Puerto Rico has the highest proportion of fires compared to its size, followed by New Jersey in the NE of the country and finally by the States in the SE of the country.


```{r}
fires_states %>%
  select(region, area) %>%
  group_by(region, area) %>%
  summarise(num_fires = n()) %>%
  mutate(fires_sqmile = num_fires / area) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = fires_sqmile)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  scale_fill_distiller(name = "Fire per Sq Mile", palette = "PuBuGn") +
  theme_map() + 
  coord_map("mollweide") + 
  ggtitle(paste0("Total US Wildfires per Square Mile from 1992-2015")) + 
  theme(plot.title = element_text(hjust = 0.5))
```

**Puerto Rico is not shown on this map, but visually we can see the data for the other 51 entries, and the south eastern states still have the highest proportion of wildfires.  Interestingly New Jersey also shows has a hotspot in the NE of the country.**


### Do causes change over time?


#### Splitting causes into 2 group for legibility. 

#### The first group is for directly man created fires.

```{r}
fires_states %>%
  select(stat_cause_descr, fire_year) %>%
  group_by(fire_year, stat_cause_descr) %>%
  filter(stat_cause_descr == "Arson" | stat_cause_descr == "Campfire" |
           stat_cause_descr == "Children" | stat_cause_descr == "Equipment Use" |
           stat_cause_descr == "Fireworks" | stat_cause_descr == "Smoking") %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = fire_year, y = num_fires, colour = stat_cause_descr) +
  geom_line()
```

**The 2 large peaks in Arson are obvious in 1999 and 2006. There was a large heatwave in 2006, but I'm not sure why this would result in an increase in arson.  Unless this was jsut due to the dry ground creating extra fuel to aid the spread of fires that would have normally not resulted in a large scale fire.  This may also be the same reason that there is also another peak in 2006 for Equipment Use.  Arson however does look to be decreasing since 2006.**


#### And this one for natural occuring fires.

```{r}
fires_states %>%
  select(stat_cause_descr, fire_year) %>%
  group_by(fire_year, stat_cause_descr) %>%
  filter(stat_cause_descr == "Debris Burning" | stat_cause_descr == "Lightning" |
           stat_cause_descr == "Miscellaneous" | stat_cause_descr == 
           "Missing/Undefined" | stat_cause_descr == "Powerline" | 
           stat_cause_descr == "Railroad" | stat_cause_descr == "Structure") %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = fire_year, y = num_fires, colour = stat_cause_descr) +
  geom_line()
```

**Similar peaks can be seen in Debris, Miscellaneous and lightning in the heatwave of 2006 that left the ground very dry.  There are peaks from 1997 to 2003 in debris, miscellaneous and lightening, but also a trough in missing/undefined, so this is likely to be due to more accurate classification of fires and not using the missing/undefined category as much.**



### Difference in causes between states


```{r}
state_map_southern <- state_map %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" | 
           region == "arkansas" | region == "louisiana")
```


```{r}
fires_states %>%
  filter(fire_year == "1992" | fire_year == "1993" | fire_year == "1994" |
           fire_year == "1995") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 1992-1995") + 
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  filter(fire_year == "1996" | fire_year == "1997" | fire_year == "1998" |
           fire_year == "1999") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 1996-1999") + 
  theme(plot.title = element_text(hjust = 0.5))
```

```{r}
fires_states %>%
  filter(fire_year == "2000" | fire_year == "2001" | fire_year == "2002" |
           fire_year == "2003") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 2000-2003") + 
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  filter(fire_year == "2004" | fire_year == "2005" | fire_year == "2006" |
           fire_year == "2007") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 2004-2007") + 
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  filter(fire_year == "2008" | fire_year == "2009" | fire_year == "2010" |
           fire_year == "2011") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 2008-2011") + 
  theme(plot.title = element_text(hjust = 0.5))
```



```{r}
fires_states %>%
  filter(fire_year == "2012" | fire_year == "2013" | fire_year == "2014" |
           fire_year == "2015") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 2012-2015") + 
  theme(plot.title = element_text(hjust = 0.5))
```

**Looking at these trends some interesting insights can be seen.  For the combined years data Florida stands out as having railroad as its main cause of wildfire, but from the above plots it can be seen that these railroad fires are only the main cause up to the 4 yearly period ending in 2003 and then the main cause changes to lightning until the end of the collection period in 2015.  Similarly arson seem reasonably popular in the southern states until 2007, when it no longer appear as the most common cause of wildfire.  This downward trend was also noted earlier in the overall causation plots for all states**



